Introduction

The objective of this workshop is to introduce statistics and data analysis using the R programming language. We will start by exploring what R is and to use it as a calculator, then we will analyse the penguins dataset.

The Rstudio interface

Throughout this workshop, we will Rstudio. Rstudio is an integrated development environment, a software that includes everything we need to use R comfortably. Here’s a overview of the interface.


The Rstudio interface you have in front of you is divided in 4 parts:

  1. Upper left: the script area. Here you can write several lines of code. To execute a bit of code written here, either select it and click on Run, or click on the line and press Ctrl + ↵

  2. Lower left: the is the R console. Here can enter single commands. All executed commands (including those ran from the scipt area) will be sent here, and R responses will be displayed here.

  3. Upper right: the environment panel. In this panel. This shows everything that has already been stored in the environment: data, variables, …

  4. Lower right: file browser, plots, help on functions, …


Files and working directory

All the necessary files are included in the folder we’ve provided to you, called StatsWorkshop. We will work inside this folder and define it as our working directory, meaning that everything that we will generate and read will be from this place. Using the Files panel in Rstudio, navigate to inside this folder. Then click on “More” on the top of this panel, then “Set as working directory”. You will notice that a command has appeared in the console panel: clicking on set as working directory has instructed Rstudio to run a setwd command. This is in fact another way to set the working directory, by executing a command setwd("PATH/TO/THE/DIRECTORY"). If we want a reminder of where R is currently working, we can request to know the working directory with the command getwd().


Module 1: R as a calculator and programming basics

Trust me, it can be a bit more impressive than that


In this module, we will see how we can use R as a (quite powerful) calculator and learn some basics of coding in R that will be necessary for the rest of the workshop.

\[ \\[1cm] \]

1.1. Basic Operations

The console of R can be used as a basic calculator. For example, we can type the following commands, and R will compute and display the result.

3+2
## [1] 5
2+3*5
## [1] 17
(2+3)*5
## [1] 25
42*123
## [1] 5166
(15+3)/(143.7-17)
## [1] 0.1420679
2^11
## [1] 2048


R also knows standard function like square root (sqrt), exponential (exp) or base 10 logarithm (log10).

sqrt(125)
## [1] 11.18034
exp(10)
## [1] 22026.47
log10(1000)
## [1] 3

Exercise 1

What are the values of the following?

\[ log_{10}\left(9+{{2+3}\over{10-5}}\right) \]

\[ \sqrt{5^3 + 3 \times 7 - 2} \]

When you are ready to see the answer, click on the “Code” button on the right below this.

\[ \\[1cm] \]

1.2. Variables

R can store values in variables. They are assigned using the symbol <-.

a <- 3
b <- 5
a*b
## [1] 15


Variables can be overwritten:

a <- 42
b <- a-40
a
## [1] 42
b
## [1] 2
a <- 3
a
## [1] 3
b
## [1] 2

Note that in the above example, overwriting a did not change the value of b: R does not go back to rerun older lines of code!


You can name your variables with whatever name you want, with a few exceptions that are protected names. Do not hesitate to give long, clear names to your variables. R is a case-sensitive language, which means that thisvariable, ThisVariable and THISVARIABLE will all be considered to be different names.

Importantly, variables can have different types. Such types include:

  • Integer, e.g. 3

  • Numeric, e.g. -3.14

  • Character, e.g. "This is a character". They are defined between quotes, either "double" or 'simple'.

  • Logical, being either TRUE or FALSE. We will cover these in more details in the next section.

There are other types, but these will be the most frequent ones. Note that not all operations are possible on all types:

character1 <- "Trying to"
character2 <- "add characters."
character1 + character2
## Error in character1 + character2: non-numeric argument to binary operator


Note: if you want to concatenate characters, the function to do so is called paste:

character1 <- "Pasting characters"
character2 <- "works much better."
paste(character1, character2)
## [1] "Pasting characters works much better."


We will often need not only one value, but a set of values. To do so, we can create vectors using c(). The i-th value in the vector v can be accessed using v[i].

v <- c(1,3,5,7,9)
v
## [1] 1 3 5 7 9
v[3]
## [1] 5


Vectors can contain any type of variable. However, they can only contain variables that are from the same type. If you want to group variables of different types, you need to use a structure called lists, but we do not have the time to cover them in this workshop.

Special case: if you want all integer values between two numbers, there is a shortcut: start:end.

numberSequence <- 3:103
numberSequence
##   [1]   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
##  [19]  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38
##  [37]  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56
##  [55]  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74
##  [73]  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92
##  [91]  93  94  95  96  97  98  99 100 101 102 103

\[ \\[1cm] \]

1.3. Logical and ‘if’ statement

We can create some variables that are called logical or booleans. This means that they can take only two values: TRUE or FALSE, sometimes coded as 1 and 0. These names are protected in R so you cannot override them by creating a variable with these names. There are specific operations that we can make on logical variables:

  • x AND y returns TRUE if both x and y are TRUE, returns FALSE otherwise

  • x OR y returns TRUE if at least one of x or y is TRUE, returns FALSE if both x and y are FALSE

  • NOT x returns the opposite of x: FALSE if is TRUE, TRUE if x is FALSE

In R, AND is coded by &, OR is coded by | (above the \ symbol on the keyboard) and not is coded by !.

t <- TRUE
f <- FALSE
t & f
## [1] FALSE
t | f
## [1] TRUE
!t
## [1] FALSE
!f
## [1] TRUE


Exercise 2

with t being defined as TRUE and f as FALSE (as above), what do the following evaluate to?

t AND NOT f

NOT (t OR f)

((NOT t) OR f) AND t

Before you evaluate them in R, try to see if you can figure them out by yourself.

When you are ready to see the answer, click on the “Code” button on the right below this.

We can evaluate whether two variable have the same value with == or if they are different with !=. These operations return logical values:

a <- 3
a == 5
## [1] FALSE
b <- 7
a != b
## [1] TRUE


The logical variables can be used to run bits of code only when a certain condition is met. To do so, we can use the if() else structure:

if(CONDITION){*CODE TO RUN IF THE CONDITION IS MET*} else{*CODE TO RUN IF THE CONDITION IS NOT MET*}

Note that the else{} statement is optional, and we can remove it if we don not need anything to be done if the condition is not met. Here is an example:

a <- 10
if(a==10){print("a is 10")} else{print("a is not 10")}
## [1] "a is 10"
a <- 5
if(a==10){print("a is 10")} else{print("a is not 10")}
## [1] "a is not 10"
a <- 3
if(a > 2){a = a-1}
a
## [1] 2

\[ \\[1cm] \]

1.4. Loops

When we need a part of the code to be run several times, there is no need to copy and paste the same chunk of code several times (imagine if you need to run it 100,000 times!). There are structures called loops. The most frequent one is the for loop, which follows this structure:

for(variable in SET OF VALUES){
  CODE TO BE RUN MULTIPLE TIMES,
  ONE FOR EACH VALUE IN SET OF VALUES
  THIS CODE CAN USE variables
}


Here are a few examples:

for(a in c(1,3,5,7,9)){
  print(a)
}
## [1] 1
## [1] 3
## [1] 5
## [1] 7
## [1] 9
count <- 0
for(loop in 1:10){
  count <- count +1
}
count
## [1] 10
sentence <- ""
for(word in c("One","word","at","a","time.")){
  sentence <- paste(sentence,word)
}
sentence
## [1] " One word at a time."


Exercise 3

Write a loop that prints every number form 1 to 15 but that replaces 3 and 9 by "foo" and 5 and 10 by "bar", giving the following output:

## [1] 1
## [1] 2
## [1] "foo"
## [1] 4
## [1] "bar"
## [1] 6
## [1] 7
## [1] 8
## [1] "foo"
## [1] "bar"
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15

When you are ready to see the answer, click on the “Code” button on the right below this.

\[ \\[1cm] \]

1.5. Packages

It is possible to expand the functions already available in R or to import new datasets using packages. Packages are sets of functions and/or data written by R users that can be shared and used by anyone.

They can be distributed on different sites, with slightly different installation methods. The easiest way to know is to always visit the package’s webpage to look at the installation instruction.

For packages distributed through CRAN (the organism responsible for R development), you can install packages using install.packages("nameOfThePackage"). During the workshop we will use functions from the packages tidyr, palmerpenguins and factoMineR. To install them, run the following:

install.packages("tidyr")
install.packages("psych")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("palmerpenguins")
install.packages("FactoMineR")


This will install these packages on your computer. They need to be loaded with library() to be usable in the R session:

library(tidyr)
library(palmerpenguins)
library(ggplot2)
library(FactoMineR)

\[ \\[3cm] \]

Module 2: First look at the penguins

Three penguins species! Artwork by @allison_horst


During the rest of the workshop, we will analyze an example dataset, that lists measurements of penguins observed in the Palmer Archipelago (Antarctica) by Dr Kristen Gorman (University of Alaska). For each animal included in the dataset, we have access to:

  • Its species

  • The island it was observed on

  • The length and depth of it bill (see illustration below)

  • Its flipper length

  • Its body mass

  • Its sex

  • The year it was observed

Just in case you are not a professional ornithologist. Artwork by @allison_horst

\[ \\[1cm] \]

2.1. Reading and writing data

The first step to access and analyse a dataset is to load it into the R environment. We provided you with the palmerpenguins dataset as a .csv file. This is a text format to represent a table where cells are separated by a comma. To load data from a .csv file into the R environment, we will use the read.table() function:

penguins <- read.table("PATH/penguins.csv",sep = ",", header = T)

Don’t forget to adjust the path to the file to allow R to find it. the sep argument specifies what was used to separate the table cells. The header parameter tells R that the first line contains headers, i.e. column names.

We stored it in a variable called penguins. Its structure is different from other variables: this a table, called a data frame. Each individual penguin is on a different line, and each column is a different information on all penguins.

To check everything went well, we can have a look at the first few lines of the data frame:

head(penguins)
##   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1  Adelie Torgersen           39.1          18.7               181        3750
## 2  Adelie Torgersen           39.5          17.4               186        3800
## 3  Adelie Torgersen           40.3          18.0               195        3250
## 4  Adelie Torgersen             NA            NA                NA          NA
## 5  Adelie Torgersen           36.7          19.3               193        3450
## 6  Adelie Torgersen           39.3          20.6               190        3650
##      sex year
## 1   male 2007
## 2 female 2007
## 3 female 2007
## 4   <NA> 2007
## 5 female 2007
## 6   male 2007

This prints the first few lines. You will notice that some data are labeled as NA. This means that this information is missing, something that happens more often than you would think in real life datasets! R encodes this with a special value NA.

\[ \\[1cm] \]

2.2. Summary of the data

To have a quick look at what is in the dataset, we can use the summary() function:

summary(penguins)
##    species             island          bill_length_mm  bill_depth_mm  
##  Length:344         Length:344         Min.   :32.10   Min.   :13.10  
##  Class :character   Class :character   1st Qu.:39.23   1st Qu.:15.60  
##  Mode  :character   Mode  :character   Median :44.45   Median :17.30  
##                                        Mean   :43.92   Mean   :17.15  
##                                        3rd Qu.:48.50   3rd Qu.:18.70  
##                                        Max.   :59.60   Max.   :21.50  
##                                        NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex                 year     
##  Min.   :172.0     Min.   :2700   Length:344         Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   Class :character   1st Qu.:2007  
##  Median :197.0     Median :4050   Mode  :character   Median :2008  
##  Mean   :200.9     Mean   :4202                      Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                      3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                      Max.   :2009  
##  NA's   :2         NA's   :2


Descriptive statistics may be a long gone memory, so here’s a reminder:

  • Median is the value that splits the set in 2 halves of the same size. For example, the median body weight of our penguins is 4202g. This means that 50% of penguins have a body mass below or equal to 4202g, and 50% have a body mass above or equal to 4202g.

  • 1st and 3rd quartiles are the values that split the set at 25% and 75%, respectively.


To have a more complete summary of the data, we can also use the function describe() from the package psych:

library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe(penguins)
##                   vars   n    mean     sd  median trimmed    mad    min    max
## species*             1 344    1.92   0.89    2.00    1.90   1.48    1.0    3.0
## island*              2 344    1.66   0.73    2.00    1.58   1.48    1.0    3.0
## bill_length_mm       3 342   43.92   5.46   44.45   43.91   7.04   32.1   59.6
## bill_depth_mm        4 342   17.15   1.97   17.30   17.17   2.22   13.1   21.5
## flipper_length_mm    5 342  200.92  14.06  197.00  200.34  16.31  172.0  231.0
## body_mass_g          6 342 4201.75 801.95 4050.00 4154.01 889.56 2700.0 6300.0
## sex*                 7 333    1.50   0.50    2.00    1.51   0.00    1.0    2.0
## year                 8 344 2008.03   0.82 2008.00 2008.04   1.48 2007.0 2009.0
##                    range  skew kurtosis    se
## species*             2.0  0.16    -1.73  0.05
## island*              2.0  0.61    -0.91  0.04
## bill_length_mm      27.5  0.05    -0.89  0.30
## bill_depth_mm        8.4 -0.14    -0.92  0.11
## flipper_length_mm   59.0  0.34    -1.00  0.76
## body_mass_g       3600.0  0.47    -0.74 43.36
## sex*                 1.0 -0.02    -2.01  0.03
## year                 2.0 -0.05    -1.51  0.04


To access one of the columns of a data frame, we can use the symbol $, with this syntax:

dataframe$columnname

penguins$species
##   [1] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
##   [7] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
##  [13] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
##  [19] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
##  [25] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
##  [31] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
##  [37] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
##  [43] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
##  [49] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
##  [55] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
##  [61] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
##  [67] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
##  [73] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
##  [79] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
##  [85] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
##  [91] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
##  [97] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
## [103] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
## [109] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
## [115] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
## [121] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
## [127] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
## [133] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
## [139] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
## [145] "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"    "Adelie"   
## [151] "Adelie"    "Adelie"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"   
## [157] "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"   
## [163] "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"   
## [169] "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"   
## [175] "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"   
## [181] "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"   
## [187] "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"   
## [193] "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"   
## [199] "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"   
## [205] "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"   
## [211] "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"   
## [217] "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"   
## [223] "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"   
## [229] "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"   
## [235] "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"   
## [241] "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"   
## [247] "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"   
## [253] "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"   
## [259] "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"   
## [265] "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"   
## [271] "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"    "Gentoo"   
## [277] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [283] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [289] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [295] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [301] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [307] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [313] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [319] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [325] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [331] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [337] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [343] "Chinstrap" "Chinstrap"
penguins$bill_length_mm
##   [1] 39.1 39.5 40.3   NA 36.7 39.3 38.9 39.2 34.1 42.0 37.8 37.8 41.1 38.6 34.6
##  [16] 36.6 38.7 42.5 34.4 46.0 37.8 37.7 35.9 38.2 38.8 35.3 40.6 40.5 37.9 40.5
##  [31] 39.5 37.2 39.5 40.9 36.4 39.2 38.8 42.2 37.6 39.8 36.5 40.8 36.0 44.1 37.0
##  [46] 39.6 41.1 37.5 36.0 42.3 39.6 40.1 35.0 42.0 34.5 41.4 39.0 40.6 36.5 37.6
##  [61] 35.7 41.3 37.6 41.1 36.4 41.6 35.5 41.1 35.9 41.8 33.5 39.7 39.6 45.8 35.5
##  [76] 42.8 40.9 37.2 36.2 42.1 34.6 42.9 36.7 35.1 37.3 41.3 36.3 36.9 38.3 38.9
##  [91] 35.7 41.1 34.0 39.6 36.2 40.8 38.1 40.3 33.1 43.2 35.0 41.0 37.7 37.8 37.9
## [106] 39.7 38.6 38.2 38.1 43.2 38.1 45.6 39.7 42.2 39.6 42.7 38.6 37.3 35.7 41.1
## [121] 36.2 37.7 40.2 41.4 35.2 40.6 38.8 41.5 39.0 44.1 38.5 43.1 36.8 37.5 38.1
## [136] 41.1 35.6 40.2 37.0 39.7 40.2 40.6 32.1 40.7 37.3 39.0 39.2 36.6 36.0 37.8
## [151] 36.0 41.5 46.1 50.0 48.7 50.0 47.6 46.5 45.4 46.7 43.3 46.8 40.9 49.0 45.5
## [166] 48.4 45.8 49.3 42.0 49.2 46.2 48.7 50.2 45.1 46.5 46.3 42.9 46.1 44.5 47.8
## [181] 48.2 50.0 47.3 42.8 45.1 59.6 49.1 48.4 42.6 44.4 44.0 48.7 42.7 49.6 45.3
## [196] 49.6 50.5 43.6 45.5 50.5 44.9 45.2 46.6 48.5 45.1 50.1 46.5 45.0 43.8 45.5
## [211] 43.2 50.4 45.3 46.2 45.7 54.3 45.8 49.8 46.2 49.5 43.5 50.7 47.7 46.4 48.2
## [226] 46.5 46.4 48.6 47.5 51.1 45.2 45.2 49.1 52.5 47.4 50.0 44.9 50.8 43.4 51.3
## [241] 47.5 52.1 47.5 52.2 45.5 49.5 44.5 50.8 49.4 46.9 48.4 51.1 48.5 55.9 47.2
## [256] 49.1 47.3 46.8 41.7 53.4 43.3 48.1 50.5 49.8 43.5 51.5 46.2 55.1 44.5 48.8
## [271] 47.2   NA 46.8 50.4 45.2 49.9 46.5 50.0 51.3 45.4 52.7 45.2 46.1 51.3 46.0
## [286] 51.3 46.6 51.7 47.0 52.0 45.9 50.5 50.3 58.0 46.4 49.2 42.4 48.5 43.2 50.6
## [301] 46.7 52.0 50.5 49.5 46.4 52.8 40.9 54.2 42.5 51.0 49.7 47.5 47.6 52.0 46.9
## [316] 53.5 49.0 46.2 50.9 45.5 50.9 50.8 50.1 49.0 51.5 49.8 48.1 51.4 45.7 50.7
## [331] 42.5 52.2 45.2 49.3 50.2 45.6 51.9 46.8 45.7 55.8 43.5 49.6 50.8 50.2


If we want to count data, we can use the function table(). We can ask it for counts of one or two variables. The useNA parameter is used to indicate R whether to display the number of NAs. It must be one of "no", "ifany" or "always", to respectively not show them, show them only if there are NAs, or to always show them, even if there are none. If not specified, it will default to “no”. Optionnaly, we can also name the variable we want to count.

table(penguins$sex, useNA = "ifany")
## 
## female   male   <NA> 
##    165    168     11
table(species = penguins$species, island = penguins$island)
##            island
## species     Biscoe Dream Torgersen
##   Adelie        44    56        52
##   Chinstrap      0    68         0
##   Gentoo       124     0         0


Exercise 4

Using the table() function, find the answer to the following:

  1. How many penguins were observed on Dream Island?

  2. How many females were observed in 2008?

  3. How many missing entries (NAs) are there in the island column?

When you are ready to see the answer, click on the “Code” button on the right below this.

\[ \\[1cm] \]

2.3. Subsetting the dataset

Perhaps we don’t want to look at all available penguins but maybe we are only interested in Adelie penguins. For this and many different analyses, we will need to only look at a subet of the dataset. There are several ways to do so. The first one is using the subset() function:

adelie <- subset(penguins, species=="Adelie")
table(species = adelie$species, island = adelie$island)
##         island
## species  Biscoe Dream Torgersen
##   Adelie     44    56        52


If we want to specify several possible value, can use the constructor %in%:

biscoeDream <- subset(penguins, island %in% c("Biscoe","Dream"))
table(species = biscoeDream$species, island = biscoeDream$island)
##            island
## species     Biscoe Dream
##   Adelie        44    56
##   Chinstrap      0    68
##   Gentoo       124     0


We can also put several conditions:

malesTorgersen <- subset(penguins, island == "Torgersen" & sex == "male")
table(sex = malesTorgersen$sex, island = malesTorgersen$island)
##       island
## sex    Torgersen
##   male        23


Another way to subset the data is to use the filter() function from the package dplyr. It works similarly to subset().

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
adelieFemales <- filter(penguins, species == "Adelie", sex == "female")


We can also specify directly which lines and/or columns to keep. To do so, we use the structure dataframe[lines, columns]. If either lines or columns are left empty, this means that we want to keep all lines or columns, respectively. We can specify which lines / columns to keep either by giving a vector of logicals, numbers, or names:

subset <- penguins[penguins$island == "Biscoe", c("species","island","body_mass_g")] #keep only lines that have "Biscoe" as island, and keep only columns "species", "island" and "body_mass_g"
subset <- penguins[25:40,] # keep only lines 25 to 40, and all columns


Another way to subset the data is to use the filter() function from the package dplyr. It works similarly to subset().

library(dplyr)
adelieFemales <- filter(penguins, species == "Adelie", sex == "female")


Exercise 5

Using table() or summary() on a subset of the penguins data frame, find out the answer to the following questions:

  1. What is the mean body weight of female Gentoo penguins?

  2. What is the smallest observed flipper length for Chinstrap penguins in 2007?

  3. How many Adelie penguins weighing less than 4,000 g were observed on Dream Island?

When you are ready to see the answer, click on the “Code” button on the right below this.

\[ \\[1cm] \]

2.4. Descriptive statistics

The summary() function allows to have a quick look at what’s in our dataset, but returns many different output. We can instead directly compute descriptive statistics using the following functions:

  • mean() for the mean

  • median() for the median

  • quantile() for any given quantile (generalization of the quartile concept to other fractions than quarters). We indicate the quantile we want with the parameter probs, expressed between 0 and 1. Note that the median is equivalent to quantile(x, probs = 0.5).

  • sd() for standard deviation, a measure of data variability

By default, all these functions will return NA if the data contains NAs. To ignore NAs and return descriptive statistics of the non-missing data, we can use the parameter na.rm = TRUE.

mean(penguins$body_mass_g)
## [1] NA
mean(penguins$body_mass_g, na.rm = TRUE)
## [1] 4201.754
median(penguins$flipper_length_mm, na.rm = TRUE)
## [1] 197
quantile(penguins$flipper_length_mm, probs = c(0.10,0.25,0.5,0.75,0.90), na.rm = T)
##   10%   25%   50%   75%   90% 
## 185.0 190.0 197.0 213.0 220.9
sd(penguins$bill_length_mm, na.rm = TRUE)
## [1] 5.459584


Before the next exercise, let’s add a new column to the dataset that will contain the bill length to depth ratio:

penguins$bill_ratio <- penguins$bill_length_mm / penguins$bill_depth_mm


Exercise 6

  1. What is the mean and standard deviation of the body mass of Adelie penguins?

  2. What is the median flipper length of female penguin?

  3. What are the quartiles and median bill length to depth ratios for Chinstrap penguins?

When you are ready to see the answer, click on the “Code” button on the right below this.

\[ \\[3cm] \]

Module 3: Data visualisation

Sorry, we won’t get THAT far in data visualisation today :S

To have a better sense of any dataset, plotting some results can be very useful. In this module, we’ll learn how to draw simple graphics using the base R functions. Those are rather rudimentary but are easier to use and still allow some flexibility. If you want to create more refined graphics using R, we encourage you to have a look at what ggplot2 can offer! We’ll give a ggplot2 example for each plot type if you want to learn more.

3.1. Histograms

One of the first graphs we can have a look at are histograms. These allow to have a rapid glance at the distribution of our data. We can draw them using the hist function. We can control the number of bars with breaks, which either takes the numeric values at which to break, or a target number of bars.

hist(penguins$bill_length_mm)

hist(penguins[penguins$species == "Gentoo", "body_mass_g"], breaks = 10)

We’re probably thinking that these plots are fairly ugly. And you’d be right. Let’s improve them. First, we can modify x axis label using the xlab parameter and a title using main.

hist(penguins[penguins$species == "Gentoo", "body_mass_g"], breaks = 10, xlab = "Body mass (g)", main = "Body mass distribution of Gentoo penguins")


Then we can change the grey color to anything we’d like. This is controlled with the col parameter. We can either set a frequent color name such as "red" or "blue", or input the hexadecimal code of the color after #:

hist(penguins[penguins$species == "Gentoo", "body_mass_g"], breaks = 10, xlab = "Body mass (g)", main = "Body mass distribution of Gentoo penguins", col = "#067275")


Exercise 7

Draw a histogram of the flipper length of female penguins, with a proper x axis label, a title and color it anything else than grey.

When you are ready to see the answer, click on the “Code” button on the right below this.

If you want an example of what ggplot2 can offer instead of the base R functions:

library(ggplot2)

ggplot(penguins, mapping = aes(x = body_mass_g, fill = species)) + # draw a ggplot2 figure on the penguins data frame, where the x axis is the body mass, and the colours depend on the species
  geom_histogram(alpha = 0.7, position = "identity", bins = 20) + # draw a histogram, with 30% transparence and overlap them
  labs(title = "Penguins body mass distribution", subtitle = "By species", x = "Body mass (g)", y = "Count") + # Set title and labels
  scale_fill_manual(values = c("#ff7900","#bf5ccb","#067275")) + # specify the colour scheme
  theme_bw() # set the theme
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

\[ \\[1cm] \]

3.2. Scatterplots

Another simple graph type are scatterplots. These are points drawn from their (x,y) coordinates. They can help to see if two numerical variables are associated. Here are some examples:

plot(x = penguins$bill_length_mm, y = penguins$bill_depth_mm)

plot(x = penguins$flipper_length_mm, y = penguins$body_mass_g)


These plots are again not very eye-pleasing. Let’s tweak them. Similarly as above, we can add labels to each axis. Let’s put dots instead of circles with the parameter pch (you can find the whole list of possible values here).Let’s also color the points depending on each penguin’s species. To do so, we will use the parameter color. We need to send him a vector of colors, with one color for each point. The easiest way to do so is the following: col = c("Adelie" = "#ff7900", "Chinstrap" = "#bf5ccb", "Gentoo" = "#067275")[penguins$species]. Finally, using the function legend(), we can add a legend to our plot.

plot(x = penguins$flipper_length_mm, y = penguins$body_mass_g, pch = 16, xlab = "Flipper length (mm)", ylab = "Body mass (g)", col = c("Adelie" = "#ff7900", "Chinstrap" = "#bf5ccb", "Gentoo" = "#067275")[penguins$species])
legend("topleft",legend = c("Adelie","Chinstrap","Gentoo"),col = c("#ff7900","#bf5ccb","#067275"), pch = 16)


Exercise 8

Draw a scatterplot representing the body mass of penguins on the x axis and their bill length on the y axis. Color it depending on the islands, and put different symbols for male and female penguins. Add proper labels, a legend for the islands on the bottom right of the plot and a legend for the sex on the top left.

When you are ready to see the answer, click on the “Code” button on the right below this.

\[ \\[1cm] \]

3.3. Boxplots

To represent the different possible values taken by a variable, we can draw boxplots. Here is a boxplot example for the flipper length of our penguins:

boxplot(penguins$flipper_length_mm)

The large horizontal bar represents the median. The box itself extends from the first to the third quartile. The whiskers try to extend to the min/max but will not extend for more than 1.5 times the interquartile range (the distance between 1st and 3rd quartiles). If some data points are outside of this range (outliers), they are drawn individually.


We can specify that we want different boxes for different modalities associated with a given categorical variable using the symbol ~:

boxplot(penguins$body_mass_g ~ penguins$species)

boxplot(penguins$flipper_length_mm ~ penguins$island)


Once again, we can make the graphs nicer using parameters such as col, xlab, ylab or main. We can also use ylim to manually specify the limits of the y axis:

boxplot(penguins$body_mass_g ~ penguins$island, col = c("darkgreen","darkblue","darkred"), xlab = "Island", ylab = "Body mass (g)", main = "Body mass by island", ylim = c(0,7000))


Exercise 9

Draw a boxplot showing the bill length by sex. Add some labels and colors, and start the y axis at 0.

When you are ready to see the answer, click on the “Code” button on the right below this.

\[ \\[3cm] \]

Module 4: Distributions and statistical tests

4.1. Data preparation to compare group means

We will focus on our penguins data creating subsets of them to see their distribution. We will step-by-step check the “rules” that need to be followed in order to perform different types of tests to check if they are “equal” or not. When we test for “equality” we usually check differences in the mean values of different groups in our dataset. Let’s start by trying to collect the data that will be used in our analysis, by focusing on the penguins from the “Adelie” and “Gentoo” species. We start by subsetting the dataset using filter. You will note in this example a different structure: |> is called a pipe. It tells R to send the object specified before |> as the main argument of the following function. It can be used to make the code clearer to read.

data_to_test <- penguins |>
  filter(species %in% c("Adelie","Gentoo"),
         !is.na(body_mass_g))

Let’s have a quick look of our data using the summary() and describe() functions. What is the output of these functions? What does it mean?

summary(data_to_test)
##    species             island          bill_length_mm  bill_depth_mm  
##  Length:274         Length:274         Min.   :32.10   Min.   :13.10  
##  Class :character   Class :character   1st Qu.:38.35   1st Qu.:15.00  
##  Mode  :character   Mode  :character   Median :42.00   Median :17.00  
##                                        Mean   :42.70   Mean   :16.84  
##                                        3rd Qu.:46.67   3rd Qu.:18.50  
##                                        Max.   :59.60   Max.   :21.50  
##  flipper_length_mm  body_mass_g       sex                 year     
##  Min.   :172.0     Min.   :2850   Length:274         Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3600   Class :character   1st Qu.:2007  
##  Median :198.0     Median :4262   Mode  :character   Median :2008  
##  Mean   :202.2     Mean   :4318                      Mean   :2008  
##  3rd Qu.:215.0     3rd Qu.:4950                      3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                      Max.   :2009  
##    bill_ratio   
##  Min.   :1.640  
##  1st Qu.:2.106  
##  Median :2.323  
##  Mean   :2.594  
##  3rd Qu.:3.137  
##  Max.   :3.613
describe(data_to_test)
##                   vars   n    mean     sd  median trimmed     mad     min
## species*             1 274    1.45   0.50    1.00    1.44    0.00    1.00
## island*              2 274    1.58   0.79    1.00    1.47    0.00    1.00
## bill_length_mm       3 274   42.70   5.20   42.00   42.54    6.23   32.10
## bill_depth_mm        4 274   16.84   2.01   17.00   16.80    2.45   13.10
## flipper_length_mm    5 274  202.18  15.05  198.00  201.84   17.79  172.00
## body_mass_g          6 274 4318.07 835.93 4262.50 4293.98 1000.76 2850.00
## sex*                 7 265    1.51   0.50    2.00    1.51    0.00    1.00
## year                 8 274 2008.04   0.81 2008.00 2008.05    1.48 2007.00
## bill_ratio           9 274    2.59   0.55    2.32    2.58    0.62    1.64
##                       max   range  skew kurtosis    se
## species*             2.00    1.00  0.20    -1.97  0.03
## island*              3.00    2.00  0.89    -0.81  0.05
## bill_length_mm      59.60   27.50  0.30    -0.68  0.31
## bill_depth_mm       21.50    8.40  0.09    -0.97  0.12
## flipper_length_mm  231.00   59.00  0.16    -1.27  0.91
## body_mass_g       6300.00 3450.00  0.22    -1.00 50.50
## sex*                 2.00    1.00 -0.02    -2.01  0.03
## year              2009.00    2.00 -0.08    -1.46  0.05
## bill_ratio           3.61    1.97  0.20    -1.64  0.03

\[ \\[1cm] \]

4.2. The normal distribution

\[ \\[1cm] \]


How can we understand that our data follow the normal distribution as seen above?
Firstly, we can plot a histogram of our data to check how they look. In R, we do the following to draw the histograms of the body mass by species using ggplot2:

ggplot(aes(x=body_mass_g), data = data_to_test) +
  geom_histogram(binwidth = 200) +
  facet_wrap(~species) +
  theme_bw()


It seems that our data, after choosing a right bin size, follow the normal distribution, without major outliers.

Secondly, we need to check the Q-Q (quantile-quantile) plots. In normally distributed data, these plots should fall along the 1:1 line. The most simple way to do it is the following

qqnorm(data_to_test$body_mass_g)
qqline(data_to_test$body_mass_g)


To have a more clear picture we can vizualize two plots for each species.

data_to_test |>
  ggplot(aes(sample = body_mass_g )) + 
  geom_qq() +
  geom_qq_line(colour ='darkred') +
  facet_wrap(~species) +
  theme_bw()


Last but not least, we need to check that the variances of the two groups are equal. If, also the variances are equal, then we can assume that the data follow the normal distribution, and we can perform a parametric test to check for their differences. To do that we will perform the Levene’s test for equality of variances. We will use the levene_test() function from the rstatix package.

library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
# check for equality of variance with Levene's test
data_to_test |> levene_test(body_mass_g ~ species) 
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## # A tibble: 1 × 4
##     df1   df2 statistic     p
##   <int> <int>     <dbl> <dbl>
## 1     1   272      1.99 0.159

What is the p-value from the test? Is it less than 0.05? If the p-value is > 0.05, it means that the variances are not significantly different.

\[ \\[1cm] \]

4.3. Parametric tests and when we use them

When we want to compare the difference between two means, we conduct a t-test. In a t-test, we test the initial hypothesis (H0), that the means are equal. Practically, after conducting the t-test, we check the p-value, and if it is < 0.05, we reject the null hypothesis, accepting an alternative hypothesis, that the means are significantly different.

There are several types of t-test, which include:

  • One-sample t-test. Here, we compare the mean of a sample to a known value. In our example, we could compare the body mass of our lovely penguins to the literature’s standard.

  • Unpaired two-sample t-test Here, we compare the mean of two independent groups. In our example, we could compare the mean value of body mass of the male penguins to the female ones. Alternatively, the body mass of the body mass of the Adlie penguins can be compared to the Gentoo penguins.

  • Paired t-test This test is used to compare the mean values of two RELATED groups of samples. For example, it could be used when we compare the average weight of a sample (weight of polar bears) before and after winter.

To compare the difference of means in more than 2 samples (3 and more), the parametric test used is the One way Analysis of Variance, but it will not be covered in this workshop.

In our example, we can conclude that the data follow the normal distribution, therefore we will perform a parametric test to compare the mean value of the body mass in our two groups, Adelie and Gentoo penguins.
In base R we do the following:

 t.test(data_to_test$body_mass_g ~ data_to_test$species)
## 
##  Welch Two Sample t-test
## 
## data:  data_to_test$body_mass_g by data_to_test$species
## t = -23.386, df = 249.64, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Adelie and group Gentoo is not equal to 0
## 95 percent confidence interval:
##  -1491.183 -1259.525
## sample estimates:
## mean in group Adelie mean in group Gentoo 
##             3700.662             5076.016


Using the dplyr package:

 data_to_test |>
   t_test(body_mass_g ~ species)
## # A tibble: 1 × 8
##   .y.         group1 group2    n1    n2 statistic    df        p
## * <chr>       <chr>  <chr>  <int> <int>     <dbl> <dbl>    <dbl>
## 1 body_mass_g Adelie Gentoo   151   123     -23.4  250. 7.71e-65


What do we conclude? Is the body mass of Adelie and Gentoo penguins significantly different?

Let’s create a nice visualization that includes the graphs we created when we checked for normality, together with boxplots that show the body mass of our two groups.

p1 <- ggplot(aes(x=body_mass_g), data = data_to_test) +
  geom_histogram(binwidth = 200) +
  facet_wrap(~species) +
  theme_bw()

p2 <- data_to_test |>
  ggplot(aes(sample = body_mass_g )) + 
  geom_qq() +
  geom_qq_line(colour ='darkred') +
  facet_wrap(~species) +
  theme_bw()

p3 <- data_to_test |>
  ggplot(aes(x = species,
             y = body_mass_g)) +
  geom_boxplot(aes(fill = species)) +
  scale_fill_manual(values = c("darkorange","cyan4")) +
  geom_jitter( alpha =0.5) +
  theme_bw()


library(patchwork)
p1/ p2 | p3


The previous was an example of an unpaired two-sample t-test.
But let’s also perform a one sample t-test where we will compare the body mass of penguins, regardless the island the come from, to the literature’s standard body mass of penguins in Antarctica, which is 3700kg.

t.test(penguins$body_mass_g, mu = 3700, alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  penguins$body_mass_g
## t = 11.571, df = 341, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 3700
## 95 percent confidence interval:
##  4116.458 4287.050
## sample estimates:
## mean of x 
##  4201.754

Does the body mass of our sample’s data significantly differ from the literature’s standard value? If, instead of 3700, we compare the body mass of the penguins’ dataset to 4200, what would be your conclusion? Would it be the same? \[ \\[1cm] \]

4.4. Non-parametric tests and when we use them

When our data are not “normal” (normally distributed), instead of a parametric test we use a non-parametric alternative. More specifically, a non-parametric test does not assume anything about the underlying distribution of our data.

In our example, in case we haven’t performed a ‘normality’ check, we can perform a Mann-Whitney U test, which is the non-parametric alternative to the unpaired t-test.

wilcox.test(data_to_test$body_mass_g ~ data_to_test$species, paired = FALSE)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  data_to_test$body_mass_g by data_to_test$species
## W = 400.5, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

\[ \\[1cm] \]

Exercise 10

Check if the the flipper length of female penguins is different from the male ones. Don’t forget to first check for normality!

When you are ready to see the answer, click on the “Code” button on the right below this.


What happens when we are comparing between more than 2 groups? For parametric testing, we would perform one-way analysis of variance (ANOVA), which we will not cover during this workshop. For non-parametric test, we will use the Kruskal-Wallis test. For instance, let’s see if the bill length is significantly different between islands:

kruskal.test(penguins$bill_length_mm ~ penguins$island)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  penguins$bill_length_mm by penguins$island
## Kruskal-Wallis chi-squared = 52.438, df = 2, p-value = 4.103e-12


Since the p-value is < 0.05, we can conclude that the bill length of penguins is significantly different depending on the islands. We can visualise the data as well:

penguins |>
  ggplot(aes(x = island,
             y = bill_length_mm)) +
  geom_boxplot(aes(fill = island)) +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Bill length by island", x = "Island", y = "Bill length (mm)") +
  theme_bw()
## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

Exercise 11

Using a non-parametric test, determine if the flipper length is significantly between the three species. Either using the base boxplot function or using ggplot2, draw a boxplot to visualise the data.

When you are ready to see the answer, click on the “Code” button on the right below this.

\[ \\[1cm] \]

4.5. When do you use which test?

The following table will help you to decide which statistical test to use based on your data:

n of Groups to compareParametric TestNon-parametric test
Two (unpaired)Un-paired t-testMann-Whitney
Two (paired)Paired t-testWilcoxon Rank sum
Three or moreOne way Analysis of VarianceKruskal-Wallis

\[ \\[3cm] \]

Module 5: Principal Components Analysis and Clustering

5.1. Principal Component Analysis

PCA (Principal Component Analysis) is a dimensionality-reducction method, which is used to reduce the dimensionality of very large datasets.
It transforms a large set of variables in to a smaller one that retains most of the information of a complex dataset.

penguins <- read.table("penguins.csv",sep = ",", header = T)
penguins <- na.omit(penguins)
#subset data with numerical values to be used in the PCA analysis.
# One way to do this, is by using the subset function
PCA.data <-  subset(penguins, select = c("bill_length_mm","bill_depth_mm","flipper_length_mm" ,"body_mass_g"))

#another way to do  it is by using the dplyr package
library(dplyr)
PCA.dplyr <- penguins |> select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)

#Next step is to remove any NA values
#PCA.data = na.omit(PCA.data)
PCA.dplyr <- na.omit(PCA.dplyr)

#next we will perform the PCA analysis using the prcomp function, and summarise the results
PCA.results <- prcomp(PCA.data, center = TRUE, scale = TRUE)
summary(PCA.results)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.6569 0.8821 0.60716 0.32846
## Proportion of Variance 0.6863 0.1945 0.09216 0.02697
## Cumulative Proportion  0.6863 0.8809 0.97303 1.00000
plot(PCA.results)

biplot(PCA.results)

plot(PCA.results$x[,1], PCA.results$x[,2])

#plot results
library(ggfortify)
autoplot(PCA.results)

autoplot(PCA.results, data = penguins, colour = 'species')

#another way to perform the PCA analysis is by using functions from the factorextra and factorminer packages
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(FactoMineR)
PCA.results2 <- PCA(PCA.data, graph = T)

fviz_screeplot(PCA.results2)

\[ \\[1cm] \]

5.2. What is clustering?

Consider clustering a task in which data points are divided into a number of groups such as the points of a group are more similar to each other than the ones that participate in different groups.
In other words, clustering seperates points/groups with similar attributes, from the ones that have different characteristics and assigns them into clusters.
Various types of clustering include:

  1. Centroid-based clustering (k-means)

  2. Density-based clustering

  3. Distribution-based clustering

  4. Hierarchical clustering

\[ \\[1cm] \]

5.3. k-means clustering

k-means clustering is considered one of the simplest clustering algorithms, that groups together certain data with a member of that certain group than with another group. “k” is the number of groups that we cluster our data points, while “mean” refers to the center of the data in that group.

#prepare the data for which we want to perform the clustering
#k.data = penguins |> na.omit(penguins) |> select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, species) |> scale()
penguins.clean <- na.omit(penguins)

#run k-means clustering for different k's and visualise it

k3 <- kmeans(penguins.clean[,3:6], centers = 3)

# put en extra column at the k.data, which will show the cluster it is assigned
k3.penguinData <- data.frame(cluster = k3$cluster, penguins.clean)

#In which cluster to the different species belong
ggplot(k3.penguinData, aes(x = cluster)) +
  geom_bar(aes(fill = species)) +
  theme_bw()

ggplot(k3.penguinData, aes(x = cluster)) +
  geom_bar(aes(fill = island)) +
  theme_bw()

k2 <- kmeans(penguins.clean[,3:6], centers = 2)
k4 <- kmeans(penguins.clean[,3:6], centers = 4)

# we will use factoextra package to compute the ideal number of k  and visualize the clusters

plot_k3 <- fviz_cluster(k3, geom = "point", data = penguins.clean[,3:6])
plot_k2 <- fviz_cluster(k2, geom = "point", data = penguins.clean[,3:6])
plot_k4 <-  fviz_cluster(k4, geom = "point", data = penguins.clean[,3:6])
plot_k3

plot_k2

plot_k4

#calculate the optimal number of k's
# Elbow method
fviz_nbclust(penguins.clean[,3:6], kmeans, method = "wss")

# Silhouette
fviz_nbclust(penguins.clean[,3:6], kmeans, method = "silhouette")

\[ \\[1cm] \]

5.4. Hierarchical clustering

Hierarchical clustering is a technique that can be divided into two categories:

  1. Agglomerative hierarchical clustering (commonly used)

Here, iniatially data points are considered an individual cluster, and at each iteration, similar clusters (data points) are merged together with other similar clusters until one cluster remains.

  1. Divisive hierarchical clustering

Here, this technique considers all the data points a single cluster, and in each iteration, data points are seperated from the cluster, which are not similar. This is a procedure that continues till we are left with n number of clusters.

To visualize hierarchical clustering, we use dendrograms, which are tree-like diagrams. In these, the sequences of merges or splits is recorded.

#prepare our dataset, removing NA values
penguins <- penguins |> filter(!is.na(bill_length_mm), !is.na(bill_depth_mm), !is.na(flipper_length_mm), !is.na(body_mass_g))

#extract the trait( variable/ -s), which will be used for te hierarchical clustering
pingu <- penguins |> select(bill_length_mm, body_mass_g)

##scale data and compute a distance matrix for them
pingu_scaled.distMatrix <- dist(scale(pingu))

#we can use different linkage types
complete.hclust <- hclust(pingu_scaled.distMatrix, method = "complete")
average.hclust <- hclust(pingu_scaled.distMatrix, method = "average")

plot(complete.hclust)

plot(average.hclust, labels = penguins$species, cex =0.2)

\[ \\[1cm] \]

5.5. Visualisation on a heatmap

Heatmaps are graphical representations of data, which use a color-code to represent their different values. The variation of colors is analogous to the intensity of the data values, and they are used to visually inform the reader about a phenomenon.
They are commonly used in biomedical research.

expr.data <- read.table("RNAseqData.csv", sep = ",", header = T)
rownames(expr.data) <- expr.data[,1]

expr.data <- as.matrix(expr.data[,2:ncol(expr.data)])
heatmap(expr.data, scale = "none")

heatmap(expr.data, scale = "row")

library(pheatmap)
library(RColorBrewer)
pheatmap(expr.data, fontsize_row = 5)

pheatmap(expr.data, color = colorRampPalette(rev(brewer.pal(n=10, name="RdBu"))) (100), 
         border_color = "grey60", scale="row", 
         cellwidth = 20, cellheight = 5, 
         cluster_rows = TRUE, cluster_cols = TRUE, 
         show_colnames = TRUE, show_rownames = TRUE,
         fontsize_row = 5, fontsize_col = 10, fontsize = 10, na_col= "black", legend = TRUE)

Exercise 12

On the penguin dataset:

  1. Create a dataset with only Adelie penguins, and keeping only body mass, flipper length and bill length

  2. Run a PCA on this dataset and visualise the penguins and variables on the first 2 axis

  3. Using the method of your choice, determine the optimal number of clusters

  4. Using k-means, cluster the Adelie penguins

  5. Perform adequate tests to compare the body mass and flipper length between your groups.

  6. Draw boxplots illustrating the previous comparisons.